Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

ReBNN: Resilient Binary Neural Network

(a). ReActNet

(b). ReBNN

Initial

32-th

64-th

96-th

128-th

160-th

192-th

224-th

256-th

FIGURE 3.29

The evolution of latent weight distribution of (a) ReActNet and (b) ReBNN. We select

the ﬁrst channel of the ﬁrst binary convolution layer to show the evolution. The model is

initialized from the ﬁrst stage training with W32A1 following [158]. We plot the distribution

every 32 epochs.

sign ﬂip, thus hindering the training. Inspired by this, we use Eq. (3.150) to calculate γ

and improve performance by 0.6%, showing that considering the proportion of the weight

oscillation allows for the necessary sign ﬂip and leads to more eﬀective training. We also

show the training loss curves in Fig. 3.30(b). As plotted, the L curves almost demonstrate

the training suﬃciency degrees. Therefore, we conclude that ReBNN with γ calculated by

Eq. (3.150) achieves the lowest training loss and an eﬃcient training process. Note that the

loss may not be minimal at each training iteration. Still, our method is just a reasonable

version of gradient descent algorithms, which can be used to solve the optimization prob-

lem as a general one. We empirically prove ReBNN’s capability of mitigating the weight

oscillation, leading to better convergence.

Resilient training process: This section shows the evolution of the latent weight distri-

bution. We plot the distribution of the ﬁrst binary convolution layer’s ﬁrst channel per 32

epochs in Fig. 3.29. As seen, our ReBNN can eﬃciently redistribute the BNNs toward re-

silience. Conventional ReActNet [158] possesses a tri-model distribution, which is unstable

due to the scaling factor with large magnitudes. In contrast, our ReBNN is constrained by

the balanced parameter γ during training, thus leading to a resilient bi-modal distribution

with fewer weights centering around zero. We also plot the ratios of sequential weight os-

cillation of ReBNN and ReActNet for the 1-st, 8-th, and 16-th binary convolution layers

TABLE 3.7

We compare diﬀerent calculation

methods of γ, including constants that

vary from 0 to 1e−2 and

gradient-based calculation.

Value of γ

Top-1

Top-5

65.8

86.3

1e−5

66.2

86.7

1e−4

66.4

86.7

1e−3

66.3

86.8

1e−2

65.9

86.5

max1≤j≤M n(|

∂L

∂ˆw^n,t

i,j ^|⁾

66.3

86.2

Eq. (3.150)

66.9

87.1